Multi-class sentiment analysis of urdu text using multilingual BERT

Sentiment analysis (SA) is an important task because of its vital role in analyzing people’s opinions. However, existing research is solely based on the English language with limited work on low-resource languages. This study introduced a new multi-class Urdu dataset based on user reviews for sentiment analysis. This dataset is gathered from various domains such as food and beverages, movies and plays, software and apps, politics, and sports. Our proposed dataset contains 9312 reviews manually annotated by human experts into three classes: positive, negative and neutral. The main goal of this research study is to create a manually annotated dataset for Urdu sentiment analysis and to set baseline results using rule-based, machine learning (SVM, NB, Adabbost, MLP, LR and RF) and deep learning (CNN-1D, LSTM, Bi-LSTM, GRU and Bi-GRU) techniques. Additionally, we fine-tuned Multilingual BERT(mBERT) for Urdu sentiment analysis. We used four text representations: word n-grams, char n-grams,pre-trained fastText and BERT word embeddings to train our classifiers. We trained these models on two different datasets for evaluation purposes. Finding shows that the proposed mBERT model with BERT pre-trained word embeddings outperformed deep learning, machine learning and rule-based classifiers and achieved an F1 score of 81.49%.

• Is it possible to utilize a deep learning model in combination with a pre-trained word embedding strategy to identify the sentiment expressed by a social network user in Urdu? • Does the deep learning approach with fastText and BERT word embedding effective than the machine learning-based approaches and the rule-based approach to sentiment analysis for the Urdu language that have been studied so far?
To answer the first study question, the use of pre-trained word embeddings for sentiment analysis of Urdu language reviews is investigated. A deep learning model based on pre-trained word embedding captures long-term semantic relationships between words, unlike rule-based and machine learning-based approaches. To answer the second question, the deep learning models were compared to the machine learning-based methods and the rule-based method of Urdu sentiment analysis. The main contribution of our research are as follows: • A new Multi-class sentiment analysis dataset for Urdu language based on user reviews. It is gathered from various domains such as food and beverages, movies and plays, software and apps, politics and sports. To the best of our knowledge, no such public Urdu corpus exists. The corpus will be made publicly available. • Fine-tuning a multilingual BERT model for Urdu sentiment classification, which has been trained on 104 languages, including Urdu, and is based on a BERT base with 12 layers, 768 hidden heads, and 110M parameters. • A set of baseline results of rule-based approach, machine-learning models (LR, MLP, Ada-Boost, RF, SVM) and deep learning models (1D-CNN, LSTM, Bi-LSTM, GRU and Bi-GRU) to create a benchmark for multiclass sentiment analysis using different text representations: fastText pre-trained word embeddings, char n-gram and word n-gram features.
The rest of the paper is organized as follows. Section "Related work" explains the related work for sentiment analysis. Section "Corpus generation" describes the creation of dataset and its statistics. Section "Proposed methodology" presents the proposed methodology. Section "Results analysis" analyze the experimental results and evaluation measures. Section "Conclusion and implications" concludes the paper.

Related work
In this section, we give a quick overview of existing datasets and popular techniques for sentiment analysis. Sentiment analysis datasets. SemEval challenges are the most prominent efforts taken in the existing literature to create standard datasets for SA. In each competition, scholars accomplish different tasks to examine semantic analysis classifications using different corpora. The outcome of such competitions is a group of standard datasets and diverse approaches for SA. These benchmark corpora have been created in the English and Arabic languages 31 . Mainly, user tweets/reviews belong to various genres such as hotel, restaurants and laptops.
Every time, the SemEval contests series comes up with the various size of corpora. In the 2013 edition, the SemEval competition used SMS and Twitter corpora, and the Twitter corpus contains a total of 15,195 reviews, was split into training, development, and testing data are 9728, 1654, and 3813, respectively, while the SMS corpus consists of 2093 reviews was only used for testing purpose. The Twitter corpus comprises a total of 1853 reviews in the 2014 edition, including 86 sarcastic tweets for testing 32 . There were five separate subtasks in the 2016 and 2017 competition series. Each task's corpus was divided into three sections: training, development, and testing. Subtask A, B, and D and subtask C and E sentences 30,632, 17,639, and 30,632 were used, respectively. There are 332 news articles in the Korean corpus for SA. Human experts manually annotated these news articles for sentiment analysis. The dataset contains 7713 subjectively annotated sentences and 17,615 opinionated expression tags utilizing the Korean Subjectivity Markup Language annotation method, reflecting the characteristics of Korean languages 33 .
Another corpus has been created in the Indonesian language. The Twitter streaming API was used to collect 3.5 million tweets 34 . A Roman Urdu corpus has been created, contains 10,021 user comments belonging Methods for sentiment analysis. Several methods have been proposed in the existing literature to solve SA tasks, such as supervised and unsupervised machine learning. In SemEval 2014 competition, both Support Vector Machine (SVM) and rule-based machine learning methods were applied. The lexicons were utilized to find the sentiment polarities of reviews using the rule-based technique. The overall polarity of the review was computed by summing the polarity scores of all words in the review and dividing by their distance from the aspect term. If a sentence's polarity score is less than zero (0), it is classified as negative; if the score is equal to zero, it is defined as neutral; and if the score is equal to or more than one, it is defined as positive. These classified features and n-gram features have been used to train machine learning algorithms. In SemEval 2016 contest edition, many machine learning algorithms such as Linear Regression (LR), Random Forest (RF), and Gaussian Regression (GR) were used 31 . The word embeddings are enhanced Natural Language Processing (NLP) method representing words or phrases into numerical numbers names as vector. Machine learning algorithms such as SVM will determine a hyperplane that classifies tweets/reviews according to their sentiment. Similarly, RF generates various decision trees, and each tree is examined before a final choice is made. In the same way, Nave Bayes (NB) is a probabilistic machine learning method that is based on the Bayes theorem 36 .
Many research studies have been published to execute SA of various resource-deprived dialects like as Khmer, Thai, Roman Urdu, Arabic and Hindi. Based on the negation and discourse relationship, a study on Hindi dialect has been conducted for sentiment analysis. A corpus of human-annotated reviews in Hindi was created. An accuracy of 80.21% was achieved using a polarity-based method 37 . Similarly, few research studies have been conducted in the Thai dialect, also considered resource-deprived languages 38 . Another study was carried out to identify abusive words in the Thai dialect. Eighty-six percent of the f-measure was attained using the machine learning method. Similarly, a research study has been conducted in the Bengali dialect 39 . In this study, the SA of Bengali reviews is executed using the word2vec embedding model. Results reveal that their proposed algorithm achieved an accuracy of 75.5%.

Urdu datasets and machine learning techniques. The essential component of any sentiment analysis
solution is a computer-readable benchmark corpus of consumer reviews. One of the most significant roadblocks for Urdu SA is a lack of resources, such as the lack of a gold-standard dataset of Urdu reviews. The truth is that most Urdu websites are designed in illustrative patterns rather than using standard Urdu encoding 40 . We recognized two methods for dataset creation from the existing literature, named as (1) automatic and (2) manual.
A research study focusing on Urdu sentiment analysis 41 created two datasets of user reviews to examine the efficiency of the proposed model. Only 650 movie reviews are included in the C1 dataset, with each review averaging 264 words in length. There are 322 positive and 328 negative reviews in corpus C1. The other dataset named C2, contains 700 reviews about refrigerators, air conditions, and televisions. The average length of words per review is 196 words.
Another study 42 used a corpus collected from the BBC Urdu news website to work on Urdu text classification. Two types of filters were successfully implemented to collect the required data. They concentrate on words like "Ghusa" (anger) and "Pyar" (love). A HTML parser is used to parse the obtained data, which yielded 500 news stories with 700 sentences containing the keywords mentioned above. These sentences were annotated for emotions. Nearly 6000 sentences not annotated with emotions were discarded from those 500 news articles.
Another study 43 on Urdu sentiment analysis subjectivity developed a corpus consisting of 6025 sentences from151 Urdu blogs from 14 various domains. Three human specialists manually classified these comments into three categories: neutral, negative, and positive. Additionally, they have implemented five supervised machine learning algorithms like SVM, Lib, NB (KNN, IBK), PART, and decision tree. Results reveal that KNN achieves the highest accuracy of 67.01% and performs better than other supervised machine learning algorithms. However, the performance of models can be enhanced by increasing the corpus size and using deep learning methods with pre-trained word embedding models.
Similarly, in work 44 , the comparison of NB versus SVM for the language preprocessing steps of Urdu documents reveals that SVM performs better than NB regarding accuracy. Additionally, normalized term frequency gives much improved results for feature selection. The major drawback of the proposed system is that the tokenization is done based on punctuation marks and white spaces. However, due to the grammatical structure of the Urdu language, the writer may put white space between a single word such as (Khoubsorat, beautiful), which will cause the tokenizer to tokenize the single word as two words (khoub) and (sorat), which is incorrect.
According to this study 45 , authors used three classic machine learning algorithms, such as NB, SVM, and Decision tree followed by a supervised machine learning approach to create Word Sense Disambiguation (WSD) in Urdu text. They test their theories using a corpus generated from Urdu news websites. They attain an f-measure of 0.71%. However, by implanting an adaptive mechanism, the system's accuracy could be increased.
Urdu datasets and deep learning techniques. Deep learning approaches have recently been investigated for classification of Urdu text. In this study 46 , authors used deep learning methods to classify Urdu documents for product manufacturing. Stop words and infrequent words were deleted, which increased performance for medium and small datasets but decreased performance for large corpora. According to their findings, CNN with several filters (3,4,5) outperformed the competition, whereas BiLSTM outperformed CLSTM and LSTM. The authors of 47 used a single layer CNN with several filters to classify documents at the document level, and the results outperformed the baseline approaches. For document classification 48  www.nature.com/scientificreports/ of hybrid, machine learning, and deep learning models. According to their findings, the normalized difference measure-based feature selection strategy increases the accuracies of all models. In this study 49 , authors recently suggested a model for Urdu SA by examining deep learning methods along with various word embeddings. For sentiment analysis, the effectiveness of deep learning algorithms such as LSTM, BiLSTM-ATT, CNN, and CNN-LSTM was evaluated.
The most significant work 50 has recently been performed on SA of Urdu text using various machine learning and deep learning techniques. Initially, Urdu user reviews of six various domains were collected from various social media platforms to build a state of art corpus. Later on, the whole Urdu corpus was manually annotated by human experts. Finally, a set of machine learning algorithms such as RF, NB, SVM, AdaBoost, MLP, LR, and deep learning algorithms such LSTM and CNN-1D were applied to validate the generated Urdu corpus. LR algorithms achieve the highest accuracy out of all others machine learning and deep learning algorithms.
A few research employing deep learning, semantic graphs and multimodal based system (MBS) have been undertaken on the areas of emotion classification 51 , concept extraction 52 , and user behavior analysis 53 . A unique CNN Text word2vec model was proposed in the research study 51 to analyze emotion in microblog texts. According to the testing results the suggested MBS 52 has a remarkable ability to learn the normal pattern of users' everyday activities and detect anomalous behaviors.
There have been very few research studies on Urdu SA, and it is still in its early stages of maturation compared to other resource-rich languages like English. Because of the scarcity of linguistic resources, this can be discouraging for language engineering scholars. The majority of previous research papers 47 focused on various areas of language processing such as stemming, stop word recognition and removal, and Urdu word segmentation and normalization. The summery of the existing literature is presented in Table 1.
Furthermore, the size of available annotated datasets is insufficient for successful sentiment analysis. However, the majority of the datasets and reviews from limited domains are only from negative and positive classes. To address this issue, this work focuses on the creation of an Urdu text corpus that includes sentences from several genres. To accomplish sentiment analysis task, we have applied various machine learning models with various features, deep learning models with combination of pre-trained word vectors and a rule-based algorithm on our created corpus UCSA-21 which has not yet investigated completely for the Urdu sentiment analysis text.

Corpus generation
This section explains how a manually annotated Urdu dataset was created to achieve Urdu SA. The collection of user comments and reviews from multiple websites, the compilation of human annotation rules, the execution of manual annotation, standardization, and finally, the description of the dataset's features are all phases involved in creating the Urdu Corpus for Sentiment Analysis (UCSA-21).
We gathered data from websites that offered unfettered access and allowed users to remark in Urdu to create a benchmark dataset for assessing Urdu sentiment. Table 2 summarizes all of the websites that we visited to get user reviews. Movies, Pakistani and Indian drama, TV discussion shows, food and recipes, politicians and www.nature.com/scientificreports/ Pakistani political parties, sport, software, blogs and forums and gadgets were among the genres from which we gathered data. During a 5-to 6-month period, three people who were well-versed in the objective manually collected user comments. Initially, the data was gathered into an excel sheet along with the following details: (1) the review ID; (2) the review's domain; and (3) the annotation label.
To implement Urdu SA, we need an annotated corpus containing user comments with their sentiments. Initially, annotations rules were defined then the corpus was annotated manually by three native speakers of the Urdu language keeping in mind those guidelines. All three native Urdu speakers were well aware of the purpose of annotation, annotated the complete dataset. Annotations guidelines were made for Urdu SA from existing literature. Figure 1 shows some samples of comments from the neutral, negative, and positive categories.

Annotation rules.
• A review is considered positive if the specified review expresses a positive meaning for all the characteristic terms. Suppose it contains words such as "acha" good, "Khoubsoorat" beautiful without containing negations like "Na" "Nahi" no as these words change the polarity 55 . • If any review expressing mutually neutral and positive classes, the review is marked as positive.
• If any review expressing any agreement, then that review is classified as positive 56 .
• If the user review expresses the negative sentiment in all aspects, then the review is marked as negative if it contains terms like "Bora" bad, "bukwas" rubbish, "zolum" cruelness, "ganda" dirty, without containing the negations as negations invert the polarity of the whole sentence 57 . • If a user comment comprises more negative words than any other class, it is classified as a negative review.
• If a sentence contained straight unsoftened disagreements, then that sentence is classified as negative 56 .
• If a review contained words such as banning, penalizing, assessing, and bidding, then that review is marked as a negative review 56 . • If a review comprises a denial, then that review is tagged as a negative review.
• If a review contains a negative term with a positive adjective, then that sentence is marked as a negative review 58 . • Mockery: sentence such as "MashaAllah se koy to rank milli ha na hamari cricket team ko ...akhiri he sahi" (By the grace of God, our cricket team got at least some rank. may that be last) as classified as negative sentences 59 . • If a sentence contains a question such as "eis team ka kia banay ga" what will happen to this team? Showing frustrations is marked as a negative review 59 . • If a piece of factual information is presented in a sentence, then the sentence is marked as a neutral sentence?.
• If assumptions, beliefs, or thoughts are shared in a review, then that review is identified as a neutral sentence 60 .
• If words like maybe (Shaid) are present in a review, they are classified as neutral 56 . Corpus characteristics. To create the standard corpora, three human experts annotated the whole UCSA-21 dataset. Master graduates annotated each user review; they are native Urdu speakers and are well familiar with SA. To ensure that our annotation guidelines were proper, we gave a random sample of 100 reviews to two annotators (X and Y) and asked them to mark and mention which ones came under which conditions. Individualistically, both annotators classified these sentences into one of three categories: negative, neutral, and positive. The conflicting reviews among annotator x and annotator y were resolved by third annotator z keeping in mind the above-discussed annotations guidelines. For the entire dataset, we achieved an Inter-Annotator Agreement (IAA) of 71.45 percent using Cohens Kappa method. The findings of the IAA score and moderate scores show that the manual annotations rules were adequately drafted, well understood, and followed by annotation specialists during the annotation stage. After evaluating the data, it was shown that the majority of the disagreement occurred between the negative and neutral (11.60%) and positive and neutral (12.01%) classifications. Summary of the corpus presented in Table 3 and 4, the UCSA-21 corpus comprises 9312 Urdu reviews, with 3,422 positive ratings, 2787 negative reviews, and 3103 neutral reviews. The statistics of corpus UCSA-21 show a class balance. Academics have worked hard to create datasets for sentiment analysis studies. Still, most of the available annotated datasets are too small and contain sentences from only a few domains, rather than multiple domains like UCSA-21. The other drawback of most of the existing corpora is they contain only two classes, negative and positive.

Proposed methodology
This section contains the experimental description of applied machine learning, rule-based, deep learning algorithms and our proposed two-layer stacked Bi-LSTM model. These algorithms have been trained and tested on our proposed UCSA-21 corpus and UCSA 50 datasets which are publically available.
Experimental datasets. In this research study, we used two urdu datasets UCSA-21(Our Proposed) and UCSA 50 to validate our proposed model. The proposed UCSA-21 dataset contains 9,312 Urdu reviews belonging to various genres such as food and recipes, movies, dramas, TV talk shows, politics, software and gadgets, and sports gathered from different social media websites. Each review in UCSA-21 belongs to one of three classes: neutral represented by 0, positive symbolized by 1, and negative reviews represented by 2. Tertiary classifications have experimented on the proposed corpus. The UCSA corpus compromises with total 9601 positive and negative user comments, contains 4843 positive and 4758 negative reviews. Tables 3 and 4 summarized the details of the used datasets in experiments.
Pre-processing. The primary goal of pre-processing is to prepare input text for subsequent tasks using various steps such as spelling correction, Urdu text cleaning, tokenization, Urdu word segmentation, normalization of Urdu text, and stop word removal. Tokenization is the process of separating each Uni-gram from sentences. The text is tokenized based on punctuation marks and white spaces. Stop words are vital words of any dialect and have no means in the context of sentiment classifications. They all are removed from the corpus to minimize corpus size. Segmentation is the method to find the boundaries among Urdu words. Due to the morphological structure of the Urdu language, the space between words does not specify a word boundary. Therefore, determining word boundaries in Urdu is essential 41 . Space-omission and Space-insertion are two main issues are linked with Urdu word segmentation. An example of a space omission among two words such as "Alamgeir", universal and similarly space insertion in a single word such as "Khoub Sorat", beautiful. In Urdu dialect, many words contain more than one string, such as "Khosh bash, " which means happiness is a Uni-gram with two strings. If during typing, that space between two strings is somehow omitted, then it will become "Khoshbash, " which is wrong syntactically and semantically either.The normalization part can be applied to fix the problem of correct encodings for the Arabic and Urdu characters with appropriate characters. Normalization brings each character in the designated uni-code array (0600-06FF) for the Urdu dialect. www.nature.com/scientificreports/ Features extraction. Text is often indicated as a vector of weighted features in NLP tasks such as text classification. Different n-gram models are utilized in this study; these are models that assign probability to a series of words.A unigram is a model that has a series of one word, such as "Natural"; similarly, a bigram is a sequence of two words, such as "Natural Language, " and a trigram model is a sequence of three words, such as "Natural Language Processing. " On our dataset, we looked at n-gram features like unigram, bigram, trigram and variouse combination of these n-gram features. Additionally, we also investigate various character gram feattures to gain best results. Recently, pre-trained word embeddings approaches 61 have experimented with several NLP-related tasks, outperforming the existing systems. The main idea behind these word embedding models is to train them on large amounts of text data and fine-tune them for specific applications. The Wikipedia and Common Crawl (CC) data were used to train the fastText word embedding model. Wikipedia is the biggest free online data source, written in more than 200 dialects. After downloading and cleaning data, the model was trained. CC is a non-profit organization, which crawls web data and makes data freely available. fastText has been trained to understand more than 150 dialects, including Urdu. This is why we choose to use the fastText word vector model in our proposed research. fastText word to vector model was trained using Skipgram 61 Figure 2 explains the abstract-level framework from data collection to classification.
The rule-based approach. Pure Urdu lexicon list containing 4728 negative and 2607 positive opinion words are publicly available. Figure 3 explains the algorithm of this approach in detail. Initially, each sentence is tokenized, and then each token is classified into one of three classes by comparing it to the available opinion words in the Urdu lexicon. The accessible Urdu lexicon and the words are used to determine the overall sentiment of the user review. If the text contains more positive tokens, the review is categorized as positive with a polarity score of 1. A review is characterized as negative with a polarity score of 2 if it contains more negative tokens (words) than positive tokens (words). Finally, a review is defined as neutral with a polarity score of 0 if it contains the same number of negative and positive words.
Deep learning models. The deep learning methods such CNN-1D, LSTM, GRU, BI-GRU, Bi-LSTM and mBERT model with word embedding model (fastText) were implemented using keras neural network library 4 for Urdu sentiment analysis to validate our proposed corpus. The technical and experimental information of deep learning algorithms are presented in this section. CNN-1D is mostly utilized in computer vision, but it also excels at classification problems in the natural language processing field. A CNN-1D is particularly capable If you intend to obtain new attributes from brief fixed-length chunks of the entire data set and the position of the feature is irrelevant 62,63 . Study 64 introduced GRU to overcome the shortcomings of recurrent neural networks, such as resolving the vanishing gradient problem using update and reset gate mechanisms.Both update and reset gates are essentially vectors that govern what information should be transmitted to the output unit. The most exciting aspect of GRU is that it can be properly trained to keep information for an extended period of time without losing track of timestamps. A sequence processing model with two GRUs is known as Bi-GRU. One takes information in a forward direction, whereas the other takes it backwards. Only the input and forget gates are present in this bidirectional recurrent neural network. LSTM 65 is a recurrent neural network design that displays state-of-the-art sequential data findings. LSTM is a technique for capturing long-term dependencies between text data. The LSTM model acquires the current word's input for each time step, and the prior or last word's output creates an output, which is utilized to feed to the next state. The prior state's hidden layer (and, in some cases, all hidden layers) is then used for classification. We use Bi-LSTM model to classify each comment according to its class. Generally, Bi-LSTM used to capture more contextual information from both previous and future time sequences. In this study we used two-layer (Forward and Backward) Bi-LSTM, which obtain word embeddings from FastText.
mBERT:BERT 66 is one of the most widely used current language modeling architectures. Its generalization capabilities allows it to be modified to a variety of downstream tasks based on the demands of the user, whether it's NER or relation extraction, question answering, or sentiment analysis. Figure 4 shows high level architecture of our Proposed model based on Multilingual BERT 67 . We fine-tune the latest multilingual (mBERT) model for Urdu sentiment recognition using supervised training data. The model mBERT developed based on single language base BERT 66 , which consists of 12 transformer layers and 768 hidden layers. The top 104 languages including Urdu with the largest Wikipedias were used to train the mBERT model. The training data for every dialect was gathered from a complete Wikipedia dump (except user and talk pages).
Transformers: The BERT small or base has 12 transformer layers, whereas the BERT large has 24 transformer layers. The Transformer is a natural language processing paradigm that aims to do sequence-to-sequence activities with long-range dependencies. The transformers made up with encoders and decoders. Furthermore, an encoder is made up of two pieces. Multi-Head Attention is the first part, while Feed Forward Neural Network www.nature.com/scientificreports/  www.nature.com/scientificreports/ where Q, K, and V are abstract vectors that extract various components from an input word. The special classification token <CLS> in our proposed mBERT model captures the entire sentence, e.g., "Ye tou......" into a fixed-dimensional pooling representation and which produced an output vector with the equal size as the hidden size and the transformers' output then fed into the fully-connected classification layer, which is the first token's ultimate hidden state, whereas the special classification token <SEP> indicates the end of this particular sentence, as illustrated in Fig. 4. The second stage is to replace 15% of tokens in each sentence with a [MASK] token (for example, the word 'Porana' is substituted with a [MASK] token). The context of non-masked tokens is then used by the mBERT model to infer the original values of masked tokens. The encoders assign a unique representation to each token. For instance, the E1 is the fixed presenter of the sentence's first word, "ye". The model is made up of many levels, each of which performs multi-headed attention on the output of the preceding layer, for example, mBERT has 12 layers. T1 is the last representation of the first token or word of every sentence in www.nature.com/scientificreports/ Fig. 4. The classification layer or softmax layer that has been added here. The classification layer has a dimension of K x H, where K is the number of classes (Positive, negative and neutral) and H is the size of the hidden state. Model Training and Fine-Tuning: The entire sentiment classification mBERT model has been trained in two phases, with the first phase involving the pre-training of the mBERT language model and the second phase involving the fine-tuning of the outmost classification layer.The Urdu mBERT has been pre-trained on the Urdu Wikipedia. The mBERT model has been fine-tuned using the training set of the proposed and UCSA datasets, which are Comprised with labelled user reviews. Especially, the fully connected classification layer has been trained in this way. During training, categorical cross-entropy was utilized as the loss function. Table 5 presents lists the hyper-parameters adopted for this research.
Evaluation measures. In this study, Urdu sentiment analysis text classification experiments have been performed to evaluate our proposed dataset by using a set of machine learning, rule-based and deep learning  www.nature.com/scientificreports/ algorithms. As a baseline algorithm for better assessment, we performed tertiary classifications experiment with 9312 reviews from our suggested UCSA-21 dataset. We depict four evaluation measures applied for evaluations of a bunch of machine learning, rule-based, and deep learning algorithms such as accuracy, precision, recall, and F1-measure.
where TN, TP, FN, and FP represent number of True Negative, True Positive, False Negative and False Positive respectively.

Results analysis
This section explains the results of various experiments that have been executed in this study, the usefulness of our proposed architecture for Urdu SA, and the discussion of revealed results. In the evaluation of various implemented machine learning, deep learning, and rule-based algorithms, it is observed that the mBERT algorithm perform better than all other models. Tables 6 and 7 presents the obtained results using various machine learning techniques with different features on our proposed UCSA-21 corpus. The results reveal that SVM performance is slightly better on the UCSA-21 dataset than other machine learning algorithms, with an accuracy of 72.71% using combination (1-2) features. The gained results clearly show that all the machine learning classifiers perform better with word feature combination (1-2) and unigram. On the other hand, obtained results indicating that the set of machine learning algorithms performance is not satisfiable with trigram and bigram word feature. RF gain 55.00 % accuracy using trigram features had the lowest accuracy of all machine learning classifiers. When compared to bigram and trigram word features, all machine learning classifiers perform better using unigram word features which is consistent with 50 .The outcomes of several machine learning methods using character gram features are represented in Table 7. Using the Char-3-gram feature, the findings demonstrated that NB and SVM outperformed all other machine learning classifiers with an accuracy of 68.29% and 67.50% respectively. on the other hand, LR had the poorest performance, with an accuracy of 58.40% when employing the char-5-gram feature. Table 8 presents the baseline results achieved using a rule-based approach to validate our proposed UCSA-21 dataset. The rule-based approach achieved an accuracy (64.20%), precision (60.50%), recall (68.09%), and F1 score (64.07. It is observed that the rule-based technique didn't achieve high scores in terms of accuracy as compared to machine learning and deep learning approaches. The lousy performance of the rule-based approach in this experiment is mere because of not considering the semantic information during the experiment; the experiment is only based on the terms in the lexicons database. One of the biggest flaws with rule-based algorithms is that it cannot distinguish humorous reviews with more positive words.The satirical reviews such as "MashaAllah se koy to rank milli ha na hamari cricket team ko. . . akhiri he sahi" translated as "By the grace of God, our cricket team got at least some rank. may that be last)" is a negative review which is wrongly classified as a positive review by rule-based approach.
Finally, this section contains the baseline results generated using many deep learning algorithms such as CNN-1D, LSTM,GRU, Bi-GRU, Bi-LSTM and our proposed model based on mBERT model. According to the results presented in Table 9, deep learning models outperforms machine learning and rule-based approach. The obtained results reveal that our proposed model fine-tuned based on mBERT with SoftMax supersedes all other deep learning models with accuracy, precision, recall, and F1 score of 77.61%, 76.15%, 78.25%, and 77.18% respectively. It is Observed that Bi-LSTM and Bi-GRU can be effective for Urdu sentiment analysis compared to other traditional machine learning, rule-based, and deep learning algorithms merely because Bi-LSTM and Bi-GRU can capture information from backward and forward ways. Bi-LSTM produces slightly better results because it understands context better than LSTM and CNN-1D. It is also observed that LSTM and CNN-1D achieves slightly better results with Attention (ATT)layer as compared Max-polling (MP) layer.
Using the UCSA corpus, Table 10 compares the results of our proposed mBERT model with those of other commonly used deep learning algorithms. The obtained results shows that mBERT with SoftMax outperform all other deep learning algorithms with accuracy, precision, recall, and F1 score of 82.50%, 81.35%, 81.65%, and 81.49% respectively.We did not apply traditional machine learning algorithms to validate UCSA corpus because in study 50 authors already set baseline results. The findings shows that deep learning and our proposed model comparatively perform better by using UCSA corpus, due to less number of classification classes. As mentioned above the UCSA corpus compromises with only two classes: Positive and Negative on the other hand our proposed UCSA-21 corpus comprises with additional neutral class. After evaluating the data, achieving highest performance on both datasets shows the effectiveness of our proposed model for Urdu sentiment analysis (Fig. 5).
The confusion matrix is a measure for assessing the validity of a classification. Figure 6 present the confusion matrix of our proposed mBERT by using UCSA-21 Urdu corpus. In Fig. 6, 78.10% of positive sentences are correctly classified as positive, while only 11.90% of positive reviews are incorrectly classified as negative, and 10.00% as neutral. Out of all reviews 78.40% of negative reviews are correctly identified as negative, while only 11.40% and 10.20% of negative reviews are incorrectly classified as neutral and positive respectively. Only www.nature.com/scientificreports/ 12.00% and 11.65% of neutral reviews are misclassified as negative and positive respectively, while 76.35 % of neutral reviews are accurately classified by our proposed model against UCSA-21 corpus. Similarly, Fig. 7 represents the confusion matrix of our proposed mBERT model using UCSA corpus which has only two classes: positive and Negative. Machine learning models, on average, contain less trainable parameters than deep neural networks, which explains why they train so quickly. Instead than employing semantic information, these classifiers define class boundaries based on the discriminative power of words in relation to their classes. Furthermore, SVM performs pretty well among all adopted machine learning approaches because it not only handles outliers significantly better than other machine learning algorithms by deriving maximum margin hyperplanes, However, it also supports the kernel technique, which allows for effective tuning of a number of hyper-parameters to reach optimal performance. In addition, SVM employs Hinge loss, which outperforms LR's log loss. Similarly, SVM's capacity to capture feature interactions to some extent makes it superior to NB, which typically treats features independently.
On the other hand, deep learning algorithms, not only automate the feature engineering process, but they are also significantly more capable of extracting hidden patterns than machine learning classifiers. Due to a lack of training data, machine learning approaches are invariably less successful than deep learning algorithms. This is exactly the situation with the hand-on Urdu sentiment analysis assignment, where proposed and customized deep learning approaches significantly outperform machine learning methodologies. Bi-LSTM and Bi-Gru are the adaptable deep learning approach that can capture information in both backward and forward directions. The proposed mBERT used BERT word vector representation which is highly effectiv for NLP tasks. Eventually this approach which is based on transformers and encoder-decoder based technology beats other deep learning, machine learning and rule-based models. Figure 5 compare the overall accuracy of three various approaches Table 6. Urdu sentiment analysis results using machine learning models with word n-gram features. www.nature.com/scientificreports/ and with proposed model used for Urdu sentiment analysis. The results reveals that the proposed mBERT model beats the deep learning, machine learning and rule-based algorithms. As previously said, the Urdu language has a morphological structure that is highly unique, exceedingly rich, and complex when compared to other resource-rich languages. Urdu is a blend of several languages, including Hindi, Arabic, Turkish, Persian, and Sanskrit, and contains loan words from these languages. These are the most common causes of algorithm misclassifications. Other reasons for incorrect classifications include the fact that the normalization of Urdu text is not yet perfect. To tokenize Urdu text, spaces between words must be removed/ inserted because the boundary between words is not visibly apparent. Similarly, in an Urdu sentence, the order of words can be changed but the sense/meaning stays the same, as in "Meeithay aam hain" and "Aam meeithay hain, " both of which have the same meaning "Mangos are sweet". Manual annotation of user reviews also one of the reasons for miss classification.   www.nature.com/scientificreports/ The primary purpose for using a set of machine learning algorithms with word and character n-gram features to establish baseline results against our proposed Urdu corpus. Our proposed dataset comprises with short and long type of user reviews that's why we used various deep learning algroithms such GRU and LSTM to investigate the performance of algroithms against Urdu text. GRU is typically used to categorize short sentences, whereas LSTM is thought to perform better versus long sentences because to its core structure. Similarly, BERT is currently one of the highest performing models for unsupervised pre-training. To address the Masked Language   www.nature.com/scientificreports/ Modelling objective, this model is based on the Transformer architecture and trained on a huge amount of unlabeled texts from Wikipedia. It shows outstanding performance on a variety of NLP tasks. Motivation using mBERT is to investigate its performance against resource deprived languages such as Urdu.
As previously stated, there is a paucity of research on using deep learning approaches to analyze Urdu sentiment. Only a few studies have been published in this field, and they all used various machine learning classifiers on a small dataset with limited domains and have only positive and negative classes. On the other hand, our dataset, contains more user reviews than earlier studies, and it includes several genres with three classifications classes: positive, negative, and neutral. Table 1 shows a summery and comparison of our research with previous research.

Conclusion and implications
A huge amount of data has been generated on social media platforms, which contains crucial information for various applications. As a result, sentiment analysis is critical for analyzing public perceptions of any product or service. We observed that in the Urdu language, majority of studies focused on language processing tasks, with only a few experiments done in the domain of Urdu sentiment analysis utilizing several classical machine learning methodologies relatively with a small data corpus with only two data classes. In contrast, we proposed a multi-class Urdu sentiment analysis dataset and used various machine and deep learning algorithms to create baseline results. Additionally, our proposed mBERT classifier, achieves F1 score of 81.49% and 77.18% using UCSA and UCSA-21 datasets respectively. This paper lays the path for more deep learning research into constructing language-independent models for languages with limited resources. Our findings reveal an essential insight: deep learning with pre-trained word embedding is a viable strategy for dealing with complicated and resource-poor languages like Urdu. In future, our plan is to use models such as GPT, GPT2 and GPT3 to improve the results. We believe that our publicly available dataset will serve as a baseline for sentiment analysis in Urdu.